Advanced models
1 Getting started
1.1 Required packages
library(highcharter) # For interactive plots
library(WDI) # For World Bank data
library(openxlsx) # A cool package to deal with Excel files/formats
library(quantmod) # Package for financial data extraction
library(tsibble) # TS with dataframes framework
library(fable) # Package for time-series models & predictions
library(feasts) # Package for time-series analysis
library(forecast) # Another package for TS (forecasting)
library(tseries) # Simple package for time-series tests
library(plotly) # For 3D graphs
library(mvtnorm) # For multivariate Gaussian distributions
library(crypto2) # For cryto data
library(ggsci) # For color palettes
library(rugarch) # For GARCH models
library(fGarch) # For GARCH models
library(vars) # For VAR (Vector AutoRegressive) models
library(tidyverse) # THE library for data science; at the end to override other functions
set.seed(42) # Random seed
impute <- function(v, n = 6){ # Imputation function
for(j in 1:n){
ind <- which(is.na(v))
if(length(ind)>0){
if(ind[1]==1){ind <- ind[-1]}
v[ind] <- v[ind-1]
}
}
return(v)
}1.2 Fetching some (crypto) data
coins <- crypto_list()
c_info <- crypto_info(coin_list = coins, limit = 30)❯ Scraping crypto info
❯ Processing crypto info
coin_names <- c("Bitcoin", "Ethereum", "Tether USDt", "BNB", "USDC", "Solana")
coin_hist <- crypto_history(coins |> dplyr::filter(name %in% coin_names),
start_date = "20170101",
end_date = "20251206")❯ Scraping historical crypto data
❯ Processing historical crypto data
Coin USDC does not have data available! Cont to next coin.
Coin Solana does not have data available! Cont to next coin.
Coin Solana does not have data available! Cont to next coin.
coin_hist <- coin_hist |> # Timestamps are at midnight, hence we add a day.
dplyr::mutate(date = 1 + as.Date(as.POSIXct(timestamp, origin="1970-01-01")))
coin_hist <- coin_hist |>
group_by(name) |>
mutate(return = close / lag(close) - 1 ) |>
ungroup()2 ARCH-type models
2.1 Realized volatility
Then, let’s have a look at volatility. It’s measured as
\[v_t=\sqrt{\frac{1}{T-1}\sum_{s=1}^T(r_{t-s}-\bar{r})^2},\] where \(\bar{r}\) is the mean return in the sample of \(T\) observations. Below, we use an optimized (C++-based function) to compute it via rolling standard deviation over 20 days.
library(RcppRoll)
coin_hist <- coin_hist |>
group_by(name) |>
na.omit() |>
mutate(real_vol_20 = roll_sd(return, n = 20, fill = NA, na.rm = T))coin_hist |>
filter(name == "Bitcoin") |>
pivot_longer(all_of(c("close", "real_vol_20")), names_to = "series", values_to = "value") |>
ggplot(aes(x = date, y = value, color = series)) + geom_line() + theme_bw() +
facet_wrap(vars(series), nrow = 2, scales = "free")# + scale_y_log10()Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).
Now let’s have a look a the autocorrelation of the volatility: it’s highly persistent.
crypto_acf <- coin_hist |>
na.omit() %>%
split(.$name) %>%
map(~acf(.$real_vol_20, plot = F)) %>%
map_dfr(
~data.frame(lag = .$lag,
acf = .$acf,
ci = qnorm(0.975)/sqrt(.$n.used)
), .id = "name")
crypto_acf |>
filter(lag > 0, lag < 15) |>
ggplot(aes(x = lag, y = acf, fill = name)) +
geom_col(position = "dodge", alpha = 0.8) + theme_bw() +
scale_fill_viridis_d(option = "magma", end = 0.9, begin = 0.1)Hence, in practice, simple autoregressive models can be too limited because some variables are not exactly stationary.
The realized volatility is based on past returns. But markets (equity mostly) have another very cool and forward-looking indicator called the VIX. The VIX is slightly different, because it is based on forward-looking option prices (taking their implied volatilities over specific horizons). Still, it is linked (empirically) to the traditional volatility.
Let’s have a look below.
min_date <- "1998-01-01" # Starting date
max_date <- "2025-11-07" # Ending date
getSymbols("^VIX", src = 'yahoo', # The data comes from Yahoo Finance
from = min_date, # Start date
to = max_date, # End date
auto.assign = TRUE,
warnings = FALSE)[1] "VIX"
VIX <- VIX |> as.data.frame() |>
rownames_to_column("date") |>
dplyr::mutate(date = as.Date(date))
head(VIX,7) # Have a look at the result! date VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted
1 1998-01-02 24.34 24.93 23.42 23.42 0 23.42
2 1998-01-05 24.11 25.02 23.02 24.36 0 24.36
3 1998-01-06 25.20 25.97 24.87 25.66 0 25.66
4 1998-01-07 26.15 27.43 25.07 25.07 0 25.07
5 1998-01-08 26.24 26.70 25.62 26.01 0 26.01
6 1998-01-09 25.79 29.35 25.69 28.69 0 28.69
7 1998-01-12 28.69 31.08 28.02 28.02 0 28.02
vix_chart <- VIX |> dplyr::select(VIX.Close)
rownames(vix_chart) <- VIX |> dplyr::pull(date)
highchart(type = "stock") %>%
hc_title(text = "Evolution of the VIX") %>%
hc_add_series(as.xts(vix_chart)) %>%
hc_tooltip(pointFormat = '{point.y:.3f}') We clearly see that the series exhibits some moments of extreme activity and its distribution is not the same through time. This is referred to as “clusters of volatility”.
2.2 Some theory
Therefore, we also need to introduce models that allow for this type of feature. In 1982, Robert Engle proposed such a model, called ARCH for “AutoRegressive Conditional Heteroskedasticity”, which was generalized in 1986 to GARCH models. Engle obtained the Nobel Prize in economics in part for this contribution.
As we show below, GARCH models have a direct link with auto-regressive processes!
The idea is to work with the innovations, \(e_t\) and to specify them a bit differently, i.e., like this: \(e_t=\sigma_t z_t\), where \(\sigma_t\) is going to be a changing standard deviation and \(z_t\) is a simple white noise. What matters is that the vol term is modelled as \[\sigma^2_t = a+\sum_{j=1}^pd_j\sigma^2_{t-j} + \sum_{k=1}^qb_ke^2_{t-k},\] hence, the value of \(\sigma_t^2\) depends both on its past and on the past of squared innovations.
There have been MANY extension to these models: T-GARCH, E-GARCH, etc. See for instance GARCH models: structure, statistical inference and financial applications.
2.3 Examples: estimation
There are many packages for GARCH estimation. We propose two: ruGARCH and fGARCH.
In fGARCH, the model is:
\[\sigma_t^2 = \omega + \alpha e_{t-1} + \beta \sigma_{t-1}^2.\] In the output, there are some additional parameters that allow to propose a more general model:
- shape is the shape parameter of the conditional distribution.
- skew is the skewness parameter of the conditional distribution.
- delta a numeric value, the exponent delta of the variance recursion, fixed at 2 usually.
\(\delta\) values other than 2 come from P-GARCH models:
\[\begin{align} \varepsilon_t & = \sigma_t z_t, \qquad z_t \overset{\text{i.i.d.}}{\sim} \text{Skew}(\text{shape},\text{skew}), \\ \sigma_t^{\delta} &= \omega + \sum_{i=1}^{q} \alpha_i\bigl(|\varepsilon_{t-i}| - \gamma_i \varepsilon_{t-i}\bigr)^{\delta} + \sum_{j=1}^{p} \beta_j\,\sigma_{t-j}^{\delta}, \end{align}\]
The distribution is the Skewed Generalized Error Distribution that allows for more flexibility than the simple Gaussian law. It nests many simpler distribution, like the Cauchy, Laplace, GED and Student laws for instance.
fit_f <- garchFit(formula = ~garch(1,1),
data = coin_hist |>
filter(name == "Bitcoin") |>
pull(return),
trace = F)
fit_f
Title:
GARCH Modelling
Call:
garchFit(formula = ~garch(1, 1), data = pull(filter(coin_hist,
name == "Bitcoin"), return), trace = F)
Mean and Variance Equation:
data ~ garch(1, 1)
<environment: 0x147db6828>
[data = pull(filter(coin_hist, name == "Bitcoin"), return)]
Conditional Distribution:
norm
Coefficient(s):
mu omega alpha1 beta1
1.2790e-03 8.8107e-05 1.0231e-01 8.3473e-01
Std. Errors:
based on Hessian
Error Analysis:
Estimate Std. Error t value Pr(>|t|)
mu 1.279e-03 5.246e-04 2.438 0.0148 *
omega 8.811e-05 1.087e-05 8.108 4.44e-16 ***
alpha1 1.023e-01 1.184e-02 8.643 < 2e-16 ***
beta1 8.347e-01 1.577e-02 52.934 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log Likelihood:
7086.565 normalized: 1.972874
Description:
Sun Nov 30 19:04:14 2025 by user:
Now, with rugarch…
spec <- ugarchspec()
fit_r <- ugarchfit(data = coin_hist |>
filter(name == "Bitcoin") |>
na.omit() |>
pull(return),
spec = spec)
fit_r@fit$coef mu ar1 ma1 omega alpha1
1.271558e-03 -4.037899e-01 3.681241e-01 8.677615e-05 1.002666e-01
beta1
8.370735e-01
The results are not exactly the same! mu and omega are close…
We recall the model below \[s_t = \omega + \alpha e_{t-1} + \beta s_{t-1}.\] ar1 is the autocorrelation coefficient, hence \(\beta\) and ma1 is \(\alpha\).
With both libraries, it is possible to make predictions.
predict(fit_f, n.ahead = 5) meanForecast meanError standardDeviation
1 0.001279035 0.02777110 0.02777110
2 0.001279035 0.02847428 0.02847428
3 0.001279035 0.02911777 0.02911777
4 0.001279035 0.02970811 0.02970811
5 0.001279035 0.03025082 0.03025082
ugarchforecast(fit_r, n.ahead = 3)
*------------------------------------*
* GARCH Model Forecast *
*------------------------------------*
Model: sGARCH
Horizon: 3
Roll Steps: 0
Out of Sample: 0
0-roll forecast [T0=1979-10-14]:
Series Sigma
T+1 0.0020578 0.02914
T+2 0.0009541 0.02971
T+3 0.0013998 0.03024
The values are quite close!
A cool feature of {rugarch} is that you can use forward rolling, as in actual backtests (more on that below).
3 Vector Auto-Regression
3.1 The VAR(1) model
Now we are switching to full multivariate mode!
With the Vector Auto-Regression (VAR). It is written as:
\[X_t = \mu + A X_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. } (0,\Sigma_\varepsilon),\] with \(X_t \in \mathbb{R}^k, \mu \in \mathbb{R}^k, A \in \mathbb{R}^{k \times k}\).
In two dimensions: \(X_t=[X_{t,1} \ X_{t,2} ]'\).
\[X_t = \mu + \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} X_{t-1} + \text{error}\]
So \(A_{11}\) measures the impact from \(X_{t-1, 1}\) onto $X_{t,1} $ (itself) and same for \(A_{22}\) for the second component. The non-diagonal terms drive the cross-effects: \(A_{12}\) is the impact of \(X_{t-1,2}\) on \(X_{t,1}\) and vice-versa.
A few probabilistic properties.
If the VAR(1) is stable (all eigenvalues of \(A\) satisfy \(|\lambda_i(A)|<1\)), the unconditional mean exists and is given by \[\mathbb{E}[X_t] = (I - A)^{-1}\mu.\]
Next, let \(\Gamma_0 = \mathbb{V}(X_t)\), it must satisfy \[ \Gamma_0 = A \Gamma_0 A' + \Sigma_\varepsilon.\]
Next, define the lag-\(h\) autocovariance matrix: \[\Gamma_h = \text{Cov}(X_t, X_{t-h}) = \mathbb{E}[X_t X_{t-h}'].\]
Using for simplicity \(X_t = A X_{t-1} + \varepsilon_t\) (the constant terms do not affect the covariances): \[ \Gamma_h = \text{Cov}(A X_{t-1} + \varepsilon_t, X_{t-h}) = A \text{Cov}(X_{t-1}, X_{t-h}) + \underbrace{\text{Cov}(\varepsilon_t, X_{t-h})}_{=0} = A \Gamma_{h-1} \]
By recursion, we obtain the multivariate version of the AR(1): \[\Gamma_h = A^h \Gamma_0, \quad h \ge 0.\] Scaling by the variance, we get the autocorrelation: \(A^h\)!
3.2 The VAR(p) Model
A Vector Autoregression of order \(p\), denoted VAR(\(p\)), models a \(k\)-dimensional time series vector \(\mathbf{X}_t = (X_{1,t}, \dots, X_{k,t})^\top\) as:
\[ \mathbf{X}_t = \mathbf{c} + A_1 \mathbf{X}_{t-1} + A_2 \mathbf{X}_{t-2} + \cdots + A_p \mathbf{X}_{t-p} + \mathbf{e}_t, \]
where:
- \(\mathbf{X}_t\) is a \(k \times 1\) vector of endogenous variables,
- \(\mathbf{c}\) is a \(k \times 1\) vector of intercept terms,
- \(A_i\) are \(k \times k\) coefficient matrices,
- \(\mathbf{e}_t\) is a \(k \times 1\) vector of innovations (reduced-form errors).
Error term assumptions: the innovations satisfy
\[\mathbf{e}_t \sim (0, \Sigma_e),\]
where \(\Sigma_e\) is a symmetric positive definite \(k \times k\) covariance matrix, allowing contemporaneous correlation across equations.
3.3 Empirics
VARs are often used on macroeconomic data. Below, we fetch a sample from the World Bank API.
wb_raw <- WDI( # World Bank data
indicator = c(
"labor" = "SL.TLF.TOTL.IN", # Labor force (# individuals)
"savings_rate" = "NY.GDS.TOTL.ZS", # Savings rate (% GDP)
"inflation" = "FP.CPI.TOTL.ZG", # Inflation rate
"trade" = "NE.TRD.GNFS.ZS", # Trade as % of GDP
"pop" = "SP.POP.TOTL", # Population
"pop_growth" = "SP.POP.GROW", # Population growth
"capital_formation" = "NE.GDI.TOTL.ZS", # Gross capital formation (% GDP)
"gdp_percap" = "NY.GDP.PCAP.CD", # GDP per capita
"RD_percap" = "GB.XPD.RSDV.GD.ZS", # R&D per capita
"educ_level" = "SE.SEC.CUAT.LO.ZS", # % pop reachiing second. educ. level
"educ_spending" = "SE.XPD.TOTL.GD.ZS", # Education spending (%GDP)
"nb_researchers" = "SP.POP.SCIE.RD.P6", # Nb researchers per million inhab.
"debt" = "GC.DOD.TOTL.GD.ZS", # Central gov. debt (% of GDP)
"gdp" = "NY.GDP.MKTP.CD" # Gross Domestic Product (GDP)
),
extra = TRUE,
start = 1960,
end = 2024) |>
mutate(across(everything(), as.vector)) |>
select(-status, -lending, -iso2c, -iso3c) |>
filter(region != "Aggregates", income != "Aggregates") |>
arrange(country, year) |>
group_by(country) |>
mutate(across(everything(), impute)) |>
mutate(gdp_growth = gdp_percap/dplyr::lag(gdp_percap) - 1, .before = "region") |>
ungroup() |>
filter(lastupdated == max(lastupdated)) |>
arrange(country, year) Ok, let’s focus on the US for simplicity. Taking into account other countries would mean/imply a Panel VAR.
To lighten the output (which can be heavy), we only focus on 3 variables.
wb_us <- wb_raw |>
filter(country == "United States") |>
select(gdp_growth, inflation, pop_growth) |>
na.omit()
fit_VAR <- VAR(wb_us, p = 1)
fit_VAR |> summary()
VAR Estimation Results:
=========================
Endogenous variables: gdp_growth, inflation, pop_growth
Deterministic variables: const
Sample size: 63
Log Likelihood: 97.776
Roots of the characteristic polynomial:
0.845 0.75 0.2559
Call:
VAR(y = wb_us, p = 1)
Estimation results for equation gdp_growth:
===========================================
gdp_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 0.422576 0.141282 2.991 0.00405 **
inflation.l1 0.001643 0.001513 1.086 0.28192
pop_growth.l1 0.004779 0.012570 0.380 0.70517
const 0.020769 0.014248 1.458 0.15025
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0253 on 59 degrees of freedom
Multiple R-Squared: 0.2914, Adjusted R-squared: 0.2554
F-statistic: 8.087 on 3 and 59 DF, p-value: 0.0001344
Estimation results for equation inflation:
==========================================
inflation = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 30.60299 8.81182 3.473 0.00097 ***
inflation.l1 0.56629 0.09439 6.000 1.29e-07 ***
pop_growth.l1 -0.83768 0.78398 -1.068 0.28965
const 0.84518 0.88868 0.951 0.34546
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.578 on 59 degrees of freedom
Multiple R-Squared: 0.6864, Adjusted R-squared: 0.6705
F-statistic: 43.05 on 3 and 59 DF, p-value: 7.201e-15
Estimation results for equation pop_growth:
===========================================
pop_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 0.838545 0.535011 1.567 0.122
inflation.l1 0.005373 0.005731 0.938 0.352
pop_growth.l1 0.862075 0.047600 18.111 <2e-16 ***
const 0.059493 0.053956 1.103 0.275
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09582 on 59 degrees of freedom
Multiple R-Squared: 0.8503, Adjusted R-squared: 0.8427
F-statistic: 111.7 on 3 and 59 DF, p-value: < 2.2e-16
Covariance matrix of residuals:
gdp_growth inflation pop_growth
gdp_growth 0.0006403 0.01862 -0.0004703
inflation 0.0186233 2.49093 -0.0238060
pop_growth -0.0004703 -0.02381 0.0091824
Correlation matrix of residuals:
gdp_growth inflation pop_growth
gdp_growth 1.0000 0.4663 -0.1939
inflation 0.4663 1.0000 -0.1574
pop_growth -0.1939 -0.1574 1.0000
Then, onto forecasting!
predict(fit_VAR, n.ahead = 3)$gdp_growth
fcst lower upper CI
[1,] 0.04827954 -0.001316983 0.09787606 0.04959652
[2,] 0.05065569 -0.004265562 0.10557695 0.05492126
[3,] 0.05195311 -0.004846277 0.10875250 0.05679939
$inflation
fcst lower upper CI
[1,] 3.000879 -0.09247298 6.094230 3.093352
[2,] 3.223890 -0.97430754 7.422087 4.198197
[3,] 3.435725 -1.34243374 8.213883 4.778159
$pop_growth
fcst lower upper CI
[1,] 0.9528018 0.7649883 1.140615 0.1878135
[2,] 0.9374872 0.6911543 1.183820 0.2463330
[3,] 0.9274756 0.6415400 1.213411 0.2859356
Another potential package/function is {dynlm}, with documentation and tutorial for VAR.
4 A primer on prediction
4.1 Training/testing & co
The key idea is backtest: you have to put yourself in the shoes of someone living in the past. By this we mean, without knowledge of the future. To do this, you need to consider a loop that spans different dates in the past. At each point in time \(t\) you:
- extract the data available at time \(t\);
- build a predictive model;
- make some prediction based on the currently available data;
- evaluate the accuracy of the prediction thanks to the (future) realized values.
A simplified diagram below.
4.2 Some code
4.3 What really matters (in finance)
Models (and people) often focus on loss (objective) functions.
In finance, we usually try to forecast returns, hence the loss function is the (root-) Mean Squared Error:
\[\text{MSE} = \frac{1}{T} \sum_{t=1}^T (r_t-\tilde{r}_t)^2, \]
where \(r_t\) is the realized return and \(\tilde{r}_t\) is the model return. Assuming predictions are unbiased \((\mathbb{E}[r_t]=\mathbb{E}[\tilde{r}_t])\), then (after some math…)
\[\text{MSE} = \mathbb{E}[(r_t - \tilde{r}_t)^2]=\mathbb{V}[r_t]+\mathbb{V}[\tilde{r}_t]-2\text{Cov}(r_t, \tilde{r}_t)\]
Question: which of these terms matter the most?
5 Advanced temporal models
5.1 (Feedforward) neural networks
A few images help explain how simple NNs work.
First, the structure. The network takes an input which is then processed via many multiplications, additions and activations.
At the very end, we get the output. It can be a simple number, but could also be more complex (vector, matrix, etc.).
The intermediate activation functions are there to break the linearity. A few examples below:
Next, the training. This phase is intended to determine “good” parameter values. Obviously, the “best” parameters would be ideal, but they are out of reach (the problem is too complex).
The core idea is gradient descent: we seek to minimize a loss function (e.g., the MSE) and this depends on the derivative:
Now, there is a very interesting process called backpropagation that involves lots of computations of derivatives (skipped here):
In the end, the parameters (weights) of the model are locally optimal.
But they depend on the training sample: whether they work well out-of-sample (on testing sets) is another question.
If yes, the model is said to generalize well. That’s the great quest of machine learning.